comprehension dataset
An Information-Theoretic Approach to Analyze NLP Classification Tasks
Wang, Luran, Gales, Mark, Raina, Vatsal
Understanding the importance of the inputs on the output is useful across many tasks. This work provides an information-theoretic framework to analyse the influence of inputs for text classification tasks. Natural language processing (NLP) tasks take either a single element input or multiple element inputs to predict an output variable, where an element is a block of text. Each text element has two components: an associated semantic meaning and a linguistic realization. Multiple-choice reading comprehension (MCRC) and sentiment classification (SC) are selected to showcase the framework. For MCRC, it is found that the context influence on the output compared to the question influence reduces on more challenging datasets. In particular, more challenging contexts allow a greater variation in complexity of questions. Hence, test creators need to carefully consider the choice of the context when designing multiple-choice questions for assessment. For SC, it is found the semantic meaning of the input text dominates (above 80\% for all datasets considered) compared to its linguistic realisation when determining the sentiment. The framework is made available at: https://github.com/WangLuran/nlp-element-influence
Assessing Distractors in Multiple-Choice Tests
Raina, Vatsal, Liusie, Adian, Gales, Mark
Multiple-choice tests are a common approach for assessing candidates' comprehension skills. Standard multiple-choice reading comprehension exams require candidates to select the correct answer option from a discrete set based on a question in relation to a contextual passage. For appropriate assessment, the distractor answer options must by definition be incorrect but plausible and diverse. However, generating good quality distractors satisfying these criteria is a challenging task for content creators. We propose automated assessment metrics for the quality of distractors in multiple-choice reading comprehension tests. Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options. We assess incorrectness using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is assessed by considering the distractor confidence - the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by pairwise comparison of an embedding-based equivalence metric between the distractors of a question. To further validate the plausibility metric we compare against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.
Benchmarks for Pir\'a 2.0, a Reading Comprehension Dataset about the Ocean, the Brazilian Coast, and Climate Change
Pirozelli, Paulo, José, Marcos M., Silveira, Igor, Nakasato, Flávio, Peres, Sarajane M., Brandão, Anarosa A. F., Costa, Anna H. R., Cozman, Fabio G.
Pir\'a is a reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change, built from a collection of scientific abstracts and reports on these topics. This dataset represents a versatile language resource, particularly useful for testing the ability of current machine learning models to acquire expert scientific knowledge. Despite its potential, a detailed set of baselines has not yet been developed for Pir\'a. By creating these baselines, researchers can more easily utilize Pir\'a as a resource for testing machine learning models across a wide range of question answering tasks. In this paper, we define six benchmarks over the Pir\'a dataset, covering closed generative question answering, machine reading comprehension, information retrieval, open question answering, answer triggering, and multiple choice question answering. As part of this effort, we have also produced a curated version of the original dataset, where we fixed a number of grammar issues, repetitions, and other shortcomings. Furthermore, the dataset has been extended in several new directions, so as to face the aforementioned benchmarks: translation of supporting texts from English into Portuguese, classification labels for answerability, automatic paraphrases of questions and answers, and multiple choice candidates. The results described in this paper provide several points of reference for researchers interested in exploring the challenges provided by the Pir\'a dataset.
YORC: Yoruba Reading Comprehension dataset
Aremu, Anuoluwapo, Alabi, Jesujoba O., Adelani, David Ifeoluwa
In this paper, we create YORC: a new multi-choice Yoruba Reading Comprehension dataset that is based on Yoruba high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.
Analyzing Multiple-Choice Reading and Listening Comprehension Tests
Raina, Vatsal, Liusie, Adian, Gales, Mark
Multiple-choice reading and listening comprehension tests are an important part of language assessment. Content creators for standard educational tests need to carefully curate questions that assess the comprehension abilities of candidates taking the tests. However, recent work has shown that a large number of questions in general multiple-choice reading comprehension datasets can be answered without comprehension, by leveraging world knowledge instead. This work investigates how much of a contextual passage needs to be read in multiple-choice reading based on conversation transcriptions and listening comprehension tests to be able to work out the correct answer. We find that automated reading comprehension systems can perform significantly better than random with partial or even no access to the context passage. These findings offer an approach for content creators to automatically capture the trade-off between comprehension and world knowledge required for their proposed questions.
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets
Bojic, Iva, Halim, Josef, Suharman, Verena, Tar, Sreeja, Ong, Qi Chwen, Phung, Duy, Ravaut, Mathieu, Joty, Shafiq, Car, Josip
Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.
A Multilingual Modeling Method for Span-Extraction Reading Comprehension
Wu, Gaochen, Xu, Bin, Chang, Dejie, Liu, Bangchang
Span-extraction reading comprehension models have made tremendous advances enabled by the availability of large-scale, high-quality training datasets. Despite such rapid progress and widespread application, extractive reading comprehension datasets in languages other than English remain scarce, and creating such a sufficient amount of training data for each language is costly and even impossible. An alternative to creating large-scale high-quality monolingual span-extraction training datasets is to develop multilingual modeling approaches and systems which can transfer to the target language without requiring training data in that language. In this paper, in order to solve the scarce availability of extractive reading comprehension training data in the target language, we propose a multilingual extractive reading comprehension approach called XLRC by simultaneously modeling the existing extractive reading comprehension training data in a multilingual environment using self-adaptive attention and multilingual attention. Specifically, we firstly construct multilingual parallel corpora by translating the existing extractive reading comprehension datasets (i.e., CMRC 2018) from the target language (i.e., Chinese) into different language families (i.e., English). Secondly, to enhance the final target representation, we adopt self-adaptive attention (SAA) to combine self-attention and inter-attention to extract the semantic relations from each pair of the target and source languages. Furthermore, we propose multilingual attention (MLA) to learn the rich knowledge from various language families. Experimental results show that our model outperforms the state-of-the-art baseline (i.e., RoBERTa_Large) on the CMRC 2018 task, which demonstrate the effectiveness of our proposed multi-lingual modeling approach and show the potentials in multilingual NLP tasks.
R3: A Reading Comprehension Benchmark Requiring Reasoning Processes
Wang, Ran, Tao, Kun, Song, Dingjie, Zhang, Zhilong, Ma, Xiao, Su, Xi'ao, Dai, Xinyu
Existing question answering systems can only predict answers without explicit reasoning processes, which hinder their explainability and make us overestimate their ability of understanding and reasoning over natural language. In this work, we propose a novel task of reading comprehension, in which a model is required to provide final answers and reasoning processes. To this end, we introduce a formalism for reasoning over unstructured text, namely Text Reasoning Meaning Representation (TRMR). TRMR consists of three phrases, which is expressive enough to characterize the reasoning process to answer reading comprehension questions. We develop an annotation platform to facilitate TRMR's annotation, and release the R3 dataset, a \textbf{R}eading comprehension benchmark \textbf{R}equiring \textbf{R}easoning processes. R3 contains over 60K pairs of question-answer pairs and their TRMRs. Our dataset is available at: \url{http://anonymous}.
Large-scale Cloze Test Dataset Created by Teachers
Xie, Qizhe, Lai, Guokun, Dai, Zihang, Hovy, Eduard
Cloze tests are widely adopted in language exams to evaluate students' language proficiency. In this paper, we propose the first large-scale human-created cloze test dataset CLOTH, containing questions used in middle-school and high-school language exams. With missing blanks carefully created by teachers and candidate choices purposely designed to be nuanced, CLOTH requires a deeper language understanding and a wider attention span than previously automatically-generated cloze datasets. We test the performance of dedicatedly designed baseline models including a language model trained on the One Billion Word Corpus and show humans outperform them by a significant margin. We investigate the source of the performance gap, trace model deficiencies to some distinct properties of CLOTH, and identify the limited ability of comprehending the long-term context to be the key bottleneck.
QuAC : Question Answering in Context
Choi, Eunsol, He, He, Iyyer, Mohit, Yatskar, Mark, Yih, Wen-tau, Choi, Yejin, Liang, Percy, Zettlemoyer, Luke
We present QuAC, a dataset for Question Answering in Context that contains 14K information-seeking QA dialogs (100K questions in total). The dialogs involve two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation. We also report results for a number of reference models, including a recently state-of-the-art reading comprehension architecture extended to model dialog context. Our best model underperforms humans by 20 F1, suggesting that there is significant room for future work on this data. Dataset, baseline, and leaderboard available at http://quac.ai.